The aim of this project is to demonstrate the application of machine learning techniques discussed in the ST310 Machine Learning module, in turn drawing primarily from the exposition in (James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani 2017) and (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009) among other texts. We examine the use of both linear techniques, e.g. Logistic Regression, and non-linear methods, e.g. Random Forests and Gradient Boosted Decision Trees, for a generic binary classification problem.
Remark: This report is best viewed as a .html file. If executing this notebooks as a .Rmd file, ensure that all libraries and dependencies are installed and that the chunks are executed in sequential order.
We obtain the dataset from the Kaggle March Tabular Playground competition. The data consists of anonymised features, which correspond to a binary outcome variable; in other words the task is a classification problem. Although the data is anonymised, the challenge description states:
The dataset used for this competition is synthetic but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.
We believe that this machine learning task is well-motivated, as predicting the amount of an insurance claim or whether an insurance claim occurred, would have a meaningful business impact in the context of actuarial pricing and risk management.
# generic functions and plotting
library(tidyverse)
library(tidyr)
library(ggplot2)
library(GGally)
library(patchwork)
library(broom)
library(data.table)
library(doParallel)
library(foreach)
# modelling
library(tidymodels)
library(glmnet)
library(randomForest)
library(xgboost)
library(catboost)
library(treesnip)
library(parsnip)
# metrics
library(car) #outliers
library(pROC)
library(pdp)
# read in data
df <- read.csv(file = "../data/train.csv")
cat(c("The dimensions of the array are :", dim(df)[1], ", ", dim(df)[2]))
The dimensions of the array are : 300000 , 32
The dataset consists of 30 features, 19 categorical variables and 11 numerical variables, and a binary outcome variable target. 73.5% of the observations have target==1, suggesting an imbalanced dataset. We will need to process these categorical variables in some manner, which we shall consider in subsequent sections.
We remove the first column id, which represents an unique identifier for each observation. We then convert the first 19 columns, which are categorical variables into the factor data type in R. Given computational limitations (the code is executed on single CPU laptops), we subsample n_train = 4000 observations from the complete dataset of 300,000 observations. For our subsequent analysis, we will use only this subsample of 4000 observations, although the approaches can be naturally extended to a larger dataset given greater computational resources. We further partition the 4000 observations, into a train set of 3000 observations and a test set of 1000 observations. All training and tuning are performed on the training data, and we will consider model performance on the test set performance scores.
Remark: Different models would require different preprocessing for the categorical variables, and thus we will further address in subsequent sections; for example the glm package can accept factors as a data type, whereas for glmnet they must be converted to the model.matrix format.
Remark: When encoding the categorical variables as dummy variables in our subsequent analysis, we find that there are over 600 unique categories across all 19 of the categorical variables. We could also consider different methods to process categorical variables, such as merging less frequently occurring categories, or target encoding (Parr, Terence and Howard, Jeremy 2019, ch. 6). Nonetheless, in our analaysis we primarily use the factors / dummies approach to encoding categorical variables.
A challenge in this case is that we have no specific domain knowledge about the meaning of particular features, and whether they may be useful. Thus in this problem, any preprocessing in terms of feature engineering, and a priori feature selection may not be that meaningful. Nonetheless, we can resort to model interpretability methods (Christoph Molnar 2019) to obtain an ex-post explanation of our models. In the linear case, we can examine coefficients and their statistical significance, and in the case of tree-based models, such as Feature Importances, Partial Dependence Plots, and Shapley values.
cat_feats = 1:19 # the first 19 columns are categorical
cont_feats <- 20:30 # and the next 11 are continuous
target_col <- 31 # target
# convert cats to factors
df <- column_to_rownames(df, var = "id") %>%
mutate_if(is.character,as.factor) %>%
mutate_at(vars(target), factor)
# subsample the data for faster model inference
set.seed(1)
n_train <- 4000
df_sample = df[sample(1:nrow(df), n_train),]
# Partition data into train and test; test will be our oos data
set.seed(1)
df_split <- initial_split(df_sample, prop = 3/4)
df_train <- training(df_split)
df_test <- testing(df_split)
We first conduct some exploratory data analysis by visualising the univariate and bivariate relationships in the dataset. This allows us to gain some understanding of the data and potential modelling approaches.
We observe that the univariate distributions of the continuous variables are all multi-modal and non-normal, but they are all normalised (scaled) to the range of \([0, 1]\).
# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = starts_with("cont"), names_to = "cont") %>%
ggplot(aes(x = value))+
geom_histogram(bins = 100, alpha = 0.85)+
ggtitle("Continuous features distribution")+
facet_wrap(cont~., scales = "free") +
theme_minimal()
From the distributions of categorical variables, we see that there are variables with substantially more observations in one category, and also variables with a high number of (\(>50\)) categories, which will be an issue to address for any subsequent preprocessing.
# flatten df into using pivot_longer and plot distribution
df %>% pivot_longer(cols = contains(c("cat", "target")), names_to = "cat") %>%
ggplot(aes(x = value))+
geom_bar(alpha = 0.85)+
ggtitle("Categorical features distribution")+
facet_wrap(cat~., scales = "free", ncol = 4) +
theme_minimal(base_size = 30)
We group the continuous variables by the target and plot them as boxplots to check for any obvious differences discernible by eye. From the plots, claims with target == 1 have lower values of cont3 on average (median) than claims with target == 0. Claims with target=1 also have a much higher median value of cont4 than claims with target == 0. Hence, we would expect cont3 and cont4 to have negative relationships with target. In addition, we see a group of observations at the tail ends for cont0, cont5, cont7, cont8, cont9, cont10, which may be indicative of outliers.
# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cont_feats, target_col)] %>%
pivot_longer(cols = starts_with("cont"), names_to = "var", values_to="value") %>%
ggplot(aes(x=target,y=value), fill=factor(value)) +
geom_boxplot() + coord_flip() + facet_wrap(~var, scales="free_x")
From our conditional boxplots, we can identify potential outliers by filtering observations that lie at the extreme percentiles of those continuous variables. In particular we note that there are many values at the extreme quantiles for the variables cont0, cont5, cont7, cont8, cont9, cont10. We investigate this further using the Hampel filter, which considers points lying outside the median plus or minus 3 mean absolute deviations as outliers. Through this approach, more than 200 observations per variable would be identified as outliers, which suggests that these may not be outliers but that the distribution is just heavy-tailed. Without further information on the reasonable scale of values that individual variables can take (along with the fact that they are all normalised), we find it challenging to identify outliers through descriptive statistics and decide not to exclude any such points using this approach. We will proceed to detect outliers in another approach in subsequent modelling.
# MAD filter
hampel_filter <- function(df){
lower_bound <- median(df) - 3 * mad(df, constant = 1)
upper_bound <- median(df) + 3 * mad(df, constant = 1)
outlier_ind <- which(df < lower_bound | df > upper_bound)
return(outlier_ind)
}
# quantile filter
percentile_filter <- function(df, lq = 0.001, uq = 0.999){
lower_bound <- quantile(df, lq)
upper_bound <- quantile(df, uq)
outlier_ind <- which(df < lower_bound | df > upper_bound)
return(outlier_ind)
}
# outlier count
hampel_count <- function(x){length(hampel_filter(x))}
pct_count <- function(x){length(percentile_filter(x))}
outlier_counts <- df_train[, cont_feats] %>% map_dfr(hampel_count)
outlier_counts[2, ] <- df_train[, cont_feats] %>% map_dfr(pct_count)
outlier_counts
For categorical variables, we use stacked bar plots to show the percentages of observations in each category that correspond to target == 0 and target == 1 respectively. A much larger proportion of claims with cat13 == B appear to be associated with target == 1 compared to cat13 == A. On the other hand, a much larger percentage of claims correspond to target == 0 if cat18 is A or B, than if cat18 is C or D.
# flatten df into using pivot_longer
# group by target and plot distribution
df[, c(cat_feats, target_col)] %>%
pivot_longer(cols = starts_with("cat"), names_to = "cat", values_to="value") %>%
ggplot(aes(x = value, fill=target)) +
geom_bar(position="fill") +
scale_y_continuous(name = "Within group Percentage", labels = scales::percent) +
facet_wrap(~cat, scales="free_x", ncol = 4) +
theme_minimal(base_size = 40)
We also inspect the correlation matrix for our continuous variables. There seems to be a cluster of variables cont1, cont2, cont8 that are highly correlated with each other. This may be potentially indicative of multicollinearity, a concern when using linear models. In addition, we also consider the (Pearson) correlation and our continuous variables, although it is not necessarily meaningful in this case given our target is binary.
We note that the continuous variables have Pearson correlations of roughly \([-0.2, 0.2]\), suggesting that there is some small signal or information between the features and the target. However, the Pearson correlation only describes a linear and pairwise association between the variables. As such there could also be complex non-linear associations and interactions between the variables, which the presence of multi-modal distributions of the continuous features in our univariate EDA may also affirm.
#heatmap, high values in red, low in yellow
cor_matrix <- cor(df[, cont_feats])
heatmap(cor_matrix, main="Correlation Matrix (Clustered)")
cor_with_target <- rownames_to_column(data.frame(cor(df[, c(cont_feats)], as.numeric(df$target))))
names(cor_with_target) <- c("feature", "pearson_corr")
cor_with_target %>%
ggplot(aes(x=reorder(feature, -pearson_corr), y = pearson_corr)) +
geom_bar(stat='identity') + coord_flip() + ggtitle("Pearson Correlation of Continuous with Target")
We can use Principal Components Analysis (PCA) as a means to visualise the data in low-dimension, and examine whether there are any explicitly discernible trends. We apply PCA to all the continuous features, without further pre-scaling is required given the continuous features have values from \([0, 1]\). By eye, there do not appear to be any significant difference in the PCA representations for each class, although there are many observations with target == 1 closer to the bottom of the ellipsoid formed by the first two PCA loadings. At least in the space formed by the first two princial components, the classes do not appear to be linearly separable, - which suggest a non-linear method may be more effective on this classification task.
On examination of the cumulative explained variance ratio, the variation between the continuous features \(\mathbf{X}\) captured by \(k\) principal components, we find that the first two principal components only capture about \(\approx 60\%\) variation in the continuous variables. To capture \(\approx 90\%\) of the variation, we would need 6 of the continuous features, and for \(\approx 95\%\) we would need 9. This could suggest that including more features could be beneficial, which would be a concern if we are to do feature selection.
# pca loadings
pcs <- prcomp(df[,cont_feats])
set.seed(2021)
data.frame(pc1=pcs$x[,1], pc2=pcs$x[,2], target=df[, "target"]) %>%
ggplot(aes(x = pc1, y = pc2, colour = target)) +
geom_jitter(alpha=0.7) + ggtitle('Principal Components')
# cumulative explained variance ratio
cumul_var <- cumsum(pcs$sdev^2 / sum(pcs$sdev^2))
ggplot(data.frame(feature = 1:11, cumul_var = cumul_var)) +
geom_line(aes(x = feature,y = cumul_var)) + ggtitle("Cumulative Explained Variance Ratio") +
scale_y_continuous(labels=percent) + xlab("Number of Principal Components") +
ylab("Cumulative explained Variance Ratio")
We first consider a naive model that always predicts a single label (in this case the greater accuracy is obtained by always predicting \(0\), given it is the most frequently occurring class). The accuracy in this case, the proportion of correct labels \(\sum_{i = 1}^{n} I(y_{i} = \hat{y_{i}})\), would be \(1 - \hat{y} = 0.745\). This illustrates a potential issue with the accuracy metric - a “high” accuracy may be reflective of the distribution of the target and not the model performance itself, so when comparing model accuracies, an appropriate baseline is needed.
We also consider the ROC-AUC metric (Receiving Operator Characteristic Area Under the Curve). Given the problem is one related to insurance, the modelling outcome of interest is not only to obtain the correct predictions (accuracy), but also accurate probabilities, which the AUC metric captures to some extent.
The AUC metric calculates the area under the ROC-curve, the plot of the true positive rate or sensitivity \(\frac{TP}{TP + FN}\) against the false positive rate \(\frac{FP}{TN + FP}\), assessing the performance of the classifier’s predicted probabilities across all possible decision thresholds. A classifier that always predicts 0s would result in a baseline AUC of \(0.5\).
As we do not know the specific threshold relevant to the insurance context of the problem, we will report both accuracy, with a decision threshold set to \(> 0.5\), and the AUC score for all models considered. In practice, the decision threshold would largely depend on the requirements of the modeller.
# train-test
X_train = as.matrix(df_train[, cont_feats])
y_train = as.numeric(as.matrix(df_train$target))
X_test = as.matrix(df_test[, cont_feats])
y_test = as.numeric(as.matrix(df_test$target))
# helper function for metrics
evaluate_metrics <- function(train_pred, test_pred, train_true, test_true){
train_classes <- ifelse(train_pred > 0.5, 1,0)
test_classes <- ifelse(test_pred > 0.5, 1,0)
acc1 <- mean(train_classes == train_true)
auc1 <- auc(roc(train_true, train_pred, quiet=TRUE))
acc2 <- mean(test_classes == test_true)
auc2 <- auc(roc(test_true, test_pred, quiet=TRUE))
data.frame(train_acc=acc1, train_auc=auc1, test_acc = acc2, test_auc=auc2)
}
# metrics for naive prediction
results = data.frame(evaluate_metrics(rep(0, dim(df_train)[1]),
rep(0, dim(df_test)[1]),
df_train$target, df_test$target),
row.names=c("naive"))
results
To fulfill the project requirements, we implement a `from scratch’ Stochastic Gradient Descent routine for Logistic Regression. The derivation follows (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 120–26)
Consider a matrix of variables \(X = (x_{1}, x_{2}, \ldots x_{n})^{T}\), where \(x_{i}\) denote the i-th row or observation, the target \(\mathbf{y} = (y_{1}, y_{2}, \ldots, y_{n})\). Let \(\frac{p(x_{i}; \beta)}{1 - p(x_{i} ; \beta)} = exp(\beta^{T} x_{i})\) denote the odds ratio.
We are interested in finding the coefficients \(\beta\), such that the logistic loss (also referred to as binary cross-entropy, and equivalent to the negative log-likelihood) is minimised. The logistic loss is given by:
\[\begin{align}l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N} y_{i} log(p(x_{i} ; \boldsymbol{\beta})) + (1 - y_{i}) log(1 - p(x_{i} ; \boldsymbol{\beta}))\\ &= \sum_{i = 1}^{N} \left [ y_{i}log \left (\frac{p(x_{i} ; \boldsymbol{\beta})}{1 - p(x_{i} ; \boldsymbol{\beta})} \right) + log(1 - p(x_{i} ; \boldsymbol{\beta})) \right ]\\ l(\boldsymbol{\beta}) &= -\sum_{i = 1}^{N}\left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] \tag{1} \end{align}\]
With the inclusion of a regularisation term, specifically a \(L^{2}\) penalty, the loss function becomes:
\[ \begin{equation}l(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [y_{i} \boldsymbol{\beta}^{T} x_{i} - log(1 + exp(\boldsymbol{\beta}^{T}x_{i})) \right ] - \lambda \boldsymbol{\beta}^{T} \boldsymbol{\beta} \tag{2} \end{equation}\]
The gradient of the loss function with respect to the coefficients \(\mathbf{\beta}\) is given by:
\[\nabla(\boldsymbol{\beta}) = -\sum_{i = 1}^{N} \left [ y_{i} x_{i} - \frac{x_{i}exp(\boldsymbol{\beta}^{T} x_{i})}{1 + exp(\boldsymbol{\beta}^{T} x_{i})} \right] - \lambda 2\boldsymbol{\beta}\]
We apply (stochastic) gradient descent. In short, this relies on updating the coefficients \(\beta\) iteratively based on a step size \(\lambda\):
\[\beta_{t + 1} = \beta_{t} - \lambda \nabla(\boldsymbol{\beta}_{t})\]
More sophisticated approaches such as Momentum or Adam can be used for the iterative updates; these are better outlined in (Nisheeth K. Vishnoi 2020). In our simple implementation, we use the Barzilai-Borwein method (Murphy, Kevin P 2012, 444–45) to determine the step size.
\[\lambda_{t} = \frac{|(\beta_{t} - \beta_{t - 1})^{T}(\nabla F(\beta_{t}) - \nabla F(\beta_{t - 1}))|}{\| \nabla F(\beta_{t}) - \nabla F(\beta_{t - 1}))\|^{2}}\] Given sufficient iterations and appropriate step size, the model should converge (i.e. the change in the coefficients \(\|\beta_{t + 1} - \beta_{t}\|\) or the change in loss is less than some threshold \(\epsilon\)), although this is not guaranteed.
# binary crossentropy / log-loss
log_loss <- function(x, y, betas, lambda){
logits <- x %*% betas
- (t(y) %*% logits - sum(log(1 + exp(logits))) + lambda * t(betas) %*% betas) / dim(x)[1]
}
# logistic regression gradients
gradients <- function(x, y, betas, lambda){
logits <- x %*% betas
- (t(x) %*% (y - exp(logits)/(1 + exp(logits)))) - lambda *2 * betas / dim(x)[1]
}
# pre set parameters
p = dim(X_train)[2]
lambda = 0
n_iters <- 100
init_step_size <- 1e-6
set.seed(2021)
beta_init <- matrix(rnorm(p),nrow=p)
beta_path <- matrix(rep(0, n_iters * p), nrow = n_iters, ncol=p)
beta_path[1,] = beta_init
last_grad <- grad <- gradients(X_train, y_train, beta_path[1,], lambda)
beta_path[2,] = beta_init - init_step_size * grad
grad <- gradients(X_train, y_train, beta_path[2,], lambda)
losses <- rep(0, n_iters)
# iteratively update betas
for (i in 3:n_iters){
step_size <- as.numeric(t(beta_path[i - 1,] - beta_path[i - 2,]) %*% (grad - last_grad) /
(t(grad - last_grad) %*% (grad - last_grad)))
beta_path[i,] <- beta_path[i - 1,] - step_size * grad
last_grad <- grad
grad <- gradients(X_train, y_train, beta_path[i, ], lambda)
losses[i] <- log_loss(X_train, y_train, beta_path[i,], lambda)
}
# plot + results
ggplot(data.frame(step = 3:n_iters, loss=losses[3:n_iters])) +
geom_line(aes(x = step, y = loss)) +
ggtitle("Binary Crossentropy vs. Iterations")
pred_train <- as.numeric(1 / (1 + exp(-X_train %*% beta_path[100,])))
pred_test <- as.numeric(1 / (1 + exp(-X_test %*% beta_path[100,])))
results["sgd",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["sgd",]
We apply our “from scratch” implementation of Logistic Regression using SGD with all the numerical variables, obtaining a test accuracy of \(0.751\) and a test AUC of \(0.7082\), a small improvement in terms of accuracy from the naive strategy, but a larger improvement in terms of AUC.
Remark: We note that our implementation of Logistic Regression is quite limited, for example an inability to process the factor datatype (unless we append a one-hot encoded matrix), or an inability to examine coefficient standard errors (unless we change to a Hessian based approach / Newton’s method). Given the limitations of our implementation, we shall subsequently use the glm package for logistic regression, and the glmnet package for regularised logistic regression.
Having explored the efficacy of logistic regression on this modelling task, we now utilise a more robust implementation of logistic regression via the glm package. We consider a more parsimonious logistic regression model, hand-picking several features based on our exploratory data analysis. We select the features that have discernible differences in their distributions conditioned on the target: the predictors cont3, cont4, cat13, cat18.
By selecting few predictors, coupled with the in-built features of the glm implementation, we can obtain, to some extent, interpretability and understanding of the model. This will help guide our analysis, and identify any potential issues before expanding to more predictors or different models. We will also use this as a baseline to compare the performance of subsequent models with.
glm_base <- glm(target~cont3+cont4+cat13+cat18, data = df_train, family=binomial(link="logit"))
summary(glm_base)
Call:
glm(formula = target ~ cont3 + cont4 + cat13 + cat18, family = binomial(link = "logit"),
data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5707 -0.6702 -0.6021 0.3931 2.0095
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4821 1.0756 -1.378 0.16824
cont3 -0.6554 0.2276 -2.880 0.00398 **
cont4 -0.4772 0.2063 -2.313 0.02070 *
cat13B 1.8681 0.3346 5.584 2.36e-08 ***
cat18B 0.5581 1.0714 0.521 0.60243
cat18C 2.6081 1.0816 2.411 0.01590 *
cat18D 2.4342 1.0805 2.253 0.02427 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3406.6 on 2999 degrees of freedom
Residual deviance: 2962.8 on 2993 degrees of freedom
AIC: 2976.8
Number of Fisher Scoring iterations: 4
vif(glm_base)
GVIF Df GVIF^(1/(2*Df))
cont3 1.110907 1 1.053996
cont4 1.053938 1 1.026615
cat13 1.010521 1 1.005247
cat18 1.066127 3 1.010729
ggcoef(glm_base)
Firstly, the predictors are almost all statistically significant at a 5% level, with the exception of cat18B and the intercept. Inspecting the coefficient plots, which visualise the confidence intervals of each coefficient estimate, we observe that the confidence intervals of all variables except cat18B are significant at 5% level, and do not include 0. We also see that the width of the confidence intervals of the coefficients of cat18D and cat18C, which are proportional to their variances, are much larger than that of cat13B, cont4 and cont3, however as the categorical and continuous variables are on completely different scales, these are not directly comparable. It could be that this is a result of subsampling, and that as we increase the sample size the standard errors of the coefficients would decrease.
Secondly, we inspect the generalised variance inflation factor (GVIF). The variance inflation factor captures the extent to which the standard error of a predictor is increased due to its correlation with other predictors in the model, corrected by the number of degrees of freedom corresponding to the number of levels in the categorical variables. The GVIF of all predictors in our model are less than 2, which suggests there is no evidence of multicollinearity. Although our analysis of the variance inflation factor suggests that there is no multicollinearity, a quick inspection of the variable cat18 indicates that there are very few observations in the level cat18A; thus cat18B would be approximately close to 1 - cat18C - cat18D and potentially collinear when the intercept is considered.s This suggests that a more robust method to encode the categorical variables could be required, such as merging infrequent categories, or perhaps using regularisation instead.
table(df_train$cat18)
A B C D
8 2584 194 214
Assuming that there is no multicollinearity, the coefficients for each predictor represents the association of that predictor with the log-odds of having target == 1: \(\log \left ( \frac{P(\text{target}=1)}{P\text{(target}=0)} \right )\).
The coefficients of the dummy variables indicate the average difference between the log odds of that factor level group compared to the baseline level group. In our regression, the first category (A) of each categorical variable is taken as the baseline group. For example, the coefficient of cat13B implies the following equation: \[\log\frac{P(target=1|cat13=B)}{P(target=0|cat13=B)}-\log\frac{P(target=1|cat13=A)}{P(target=0|cat13=A)}=1.8681\]
This suggests that claims with cat13=B have \(exp(1.8681)-1=5.48\) higher odds than claims with cat13=A. Given that the coefficient of cat18B is not statistically significant (or in other words, its standard error is high), this suggests that its association with target may not be significantly different from cat18=A (inspecting the conditional plot from our exploratory data analysis, their conditional distributions on the target appear to be similar). However, having cat18=C as opposed to having cat18=A is associated with a \(exp(2.6081)-1=12.6\) larger increase in the log-odds of having target==1. Similarly, claims with cat18=D are associated with a \(exp(2.4342)-1=9.4\) greater log-likelihood of having target==1.
To interpret the coefficients of the continuous variables, log odds need to be calculated using specific pairs of values of the predictors. The non-linearity of the logistic function means that a change of one unit in the value of a predictor is not the same across the range of the predictor. cont3 and cont4 have negative coefficients as expected, and are interpreted as follows. A one unit increase in cont3 is associated with an \(exp(-0.6554)=0.519\) multiplicative effect on the odds, i.e. a 48% decrease in odds compared to the previous odds. Similarly, a claim with a one unit higher value of ‘cont4’ is \(1 - exp(-0.4772)=37.9\%\) less likely to have target==1 than a claim with a one unit lower value of cont4. As the true range of each continuous variable before normalisation is unknown, it is unclear whether this associated change in target is considered large in reality.
pred_train <- predict(glm_base, df_train, type="response")
pred_test <- predict(glm_base, df_test, type="response")
results["glm-base",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-base",]
With just a few predictors, the baseline model has a test accuracy of \(80.3\%\), which is \(7.5\%\) higher than the test accuracy of the naive model. The test AUC of the baseline model is \(0.703\), which is 0.203 higher than that of the naive model, which is a fair improvement. However, we note that the Logistic Regression using SGD and all continuous features attained a higher test accuracy - this could suggest that there are potential gains from using all features.
We now run a logistic regression using all 31 predictors. The regression coefficients can be interpreted in a similar way as the baseline model. The regression output shows that all predictors are significant at 5%, but this is unreliable since the categorical variables have many levels (623 in total).
glm_full <- glm(target~., data=df_train, family=binomial(link="logit"),
control = list(maxit = 30))
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
pred_train <- predict(glm_full, df_train, type="response")
prediction from a rank-deficient fit may be misleading
glm_full$xlevels = lapply(df[,cat_feats], levels)
pred_test <- predict(glm_full, df_test, type="response")
prediction from a rank-deficient fit may be misleading
results["glm-full",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-full",]
The full model has a higher train accuracy (0.82) but lower test accuracy (0.536) than the baseline model,and the test AUC is 0.533, lower than our baseline. This may be indicative of overfitting, and an example of the bias-variance trade-off; the more complex, full model with 31 predictors has a lower bias but higher variance than the simpler, baseline model with 4 predictors.
We also obtain the errors: “prediction from a rank-deficient fit may be misleading” which and “glm.fit: algorithm did not converge,” which may be indicative of multicollinearity. This is likely due to the inclusion of all the categorical variables; with over 600 levels for all the categorical variables it is likely that they are collinear in some way. In subsequent sections, we aim to overcome this to improve the stability of our model using shrinkage in a subsequent section. We also briefly explore merging categories, as an alternative approach, which can be found in the Appendix
At this stage, we suspect that some outliers might be heavily influencing the performance of our models, and hence we seek to detect outliers using a few measures. Formally, outliers are defined as observations with a response vector that is unusual conditional on covariates (predictors). Firstly, we look for points with large (studentised) residuals. We can then test if these residuals are significantly larger than those of other points by examining the Bonferroni-adjusted p-values. Ten points are identified to have large studentised residuals with adjust p-values less than 0.05. We store the indices of these points to be removed later.
outlierTest(glm_full)
outliers <- as.numeric(names(outlierTest(glm_full)$p))
Observations that are far from the average covariate pattern are considered to have high leverage and can be measured using hat values. Here, there are at least 100 points with high leverage. Finally, we also measure for influence, which is an observation that is an outlier and have high leverage. These are likely to influence the regression coefficients and influence can be thought of as the product of leverage and outlier. Here, we plot studentised residuals against hat-values with the size of a circle being proportional to the Cook’s distance of an observation- a measure of influence.
influenceIndexPlot(glm_full, vars = "hat")
influencePlot(glm_full)
We observe that there are 4 observations with high influence. We remove these observations along with the points identified as outliers earlier (14 in total), and compare the performance of our updated model with the original model (see Appendix).
We find that removing outliers (see Appendix) leads to a higher test accuracy of 0.684 but a lower test AUC of 0.539. We cannot interpret this further without knowing the specific threshold used to classify the target in the data. Analysing the regression output, we see that leaving out the outliers does not change the coefficient estimates, and since we have no intuitive reason to believe that these outlier values are extreme (due to no knowledge about the variables themselves), we decided to keep these outlier data points for all other models. Furthermore, it is possible that these outliers may be less impactful as we increase the sample size.
Given the issue of high dimensionality and multicollinearity, we consider a regularised form of logistic regression. Specifically, we consider ridge (logistic) regression, with the functional form as in equation (2). This method shrinks coefficients by imposing a penalty on their size, in this case a \(L^{2}\) (ridge) penalty, thereby reducing model complexity. We use the implementation offered by the glmnet package; in doing so we need to convert the data type into matrices.The glmnet package requires the data to be in a matrix data type, and hence we make the corresponding adjustment.
X_train = model.matrix(~., df_train[, -length(df_train)])
y_train <- df_train$target
X_test = model.matrix(~., df_test[, -length(df_test)])
y_test <- df_test$target
set.seed(2021)
glm_reg <- cv.glmnet(X_train, y_train, family="binomial", alpha=0, keep=TRUE)
pred_train <- as.numeric(predict(glm_reg, X_train, type="response"))
pred_test <- as.numeric(predict(glm_reg, X_test, type="response"))
results["glm-ridge",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-ridge",]
plot(glm_reg)
cnf <- confusion.glmnet(glm_reg, newx=X_test, newy=y_test, family='binomial')
print(cnf)
True
Predicted 0 1 Total
0 704 136 840
1 24 136 160
Total 728 272 1000
Percent Correct: 0.84
rocs <- roc.glmnet(glm_reg, newx=X_test, newy=y_test)
qplot(rocs$FPR, rocs$TPR) + ggtitle("ROC Curve") + xlab("FPR") + ylab("TPR")
We set alpha = 0 for ridge. The cross-validated regularised logistic regression via glmnet identifies the optimal level of regularisation \(\lambda\), that minimises in this case binomial deviance (by default using nfolds=10). In this case the optimal \(log(\lambda)\) appears to be around \(-2\). We should note that the inclusion of the penalty term \(\lambda\) ensures that there is a solution exists (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 61–68), although the coefficients will be biased.
Remark: For further exploration we could consider using a \(L^{1}\) penalty (lasso) and both the \(L^{1}\) and \(L^{2}\) penalty terms (ElasticNet) by adjusting the alpha parameter. A \(L^{1}\) penalty (Lasso) would have the effect of shrinking some coefficients to \(0\), instead of near \(0\) as in ridge, which introduces sparsity and lends itself for feature selection.
Thus far we have examined the use of (regularised) general linear models, specifically logistic regression. We have seen that for this specific modelling problem, it is possible to attain a fair improvement upon naive predictions and simple baselines, and that they are relatively interpretable.
We can interpret logistic regression via the coefficients, with the interpretation being a linear association between the log odds of \(\beta_{i}\) as one unit of the predictor \(x_{i}\), assuming there are no issues with multicollinearity. In this setting, we are very explicitly aware of how the model generates its prediction: we can express our logistic ridge regression in a tractable closed form (albeit with hundreds of terms). Hence the logstic ridge regression is relatively interpretable compared to the models that we will explore in the subsequent sections.
In the susbequent questions, we consider the problem of predictive performance. For example, perhaps the implication of mispredicting a label target==1 would be a mispricing of an insurance policy, leading to some significant financial cost. In this case, there must be some tradeoff between predictive performance and interpretability which is what we shall subsequently explore.
We now examine whether predictive performance can be improved by using non-linear methods, specifically tree-based models such as random forests. We give a brief definition of decision trees: intuitively decision trees learn a set of “rules” for prediction by partitioning the feature space. As a simple example, consider \(\hat{y} = 0.5(x_{1} > 1)\), i.e. predict \(0.5\) if \(x_{1}\) is greater than 1.
Intuitively, a random forest is an average of different decision trees, with each decision tree in the forest being exposed to only a subset of features (known as bootstrap aggregation of bagging). This has the effect of reducing the likelihood of individual trees overfitting to the training data and the variance of the individual trees. Anecdotally, this is similar to taking the “wisdom of the crowds” (i.e. ensembling). For more details, refer to (Leo Breiman 2001) and (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 587–604).
We use the tidymodels package with the ranger backend to fit a Random Forest with hyperparameter tuning. We focus on two key parameters: min_n, the number of features subsampled by each decision tree, and mtry, the number of variables used for splitting at each node. Both of these hyperparameters influence model complexity and have the effect of regularising the model.
# Model Specification
rf_rec <- recipe(target~., data = df_train)
tune_spec <- rand_forest(
mtry = tune(),
trees = 100,
min_n = tune()
) %>%
set_mode("classification") %>%
set_engine("ranger", importance="impurity", seed=2021)
tune_wf <- workflow() %>%
add_recipe(rf_rec) %>%
add_model(tune_spec)
# Hyperparameter Tuning
set.seed(234)
trees_folds <- vfold_cv(df_train, v = 3)
doParallel::registerDoParallel()
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = trees_folds,
grid = 20
)
autoplot(tune_res)
# fit best model from hyperparameters
best_auc <- select_best(tune_res, "roc_auc")
final_rf <- finalize_model(
tune_spec,
best_auc
)
final_wf <- workflow() %>%
add_recipe(rf_rec) %>%
add_model(final_rf)
rf1 <- fit(final_wf, df_train)
# plot feature importances
ranger_obj <- pull_workflow_fit(rf1)$fit
rf_feat_imp <- rownames_to_column(data.frame(ranger_obj$variable.importance));
names(rf_feat_imp) <- c("feat", "importance")
ggplot(rf_feat_imp, aes(x = reorder(feat, importance), y = importance))+
geom_bar(stat="identity", position="dodge")+ coord_flip()+
ylab("Feature Importance (Gini Impurity)")+
xlab("Feature")+
ggtitle("Random Forest Feature Importances")
#evaluate metrics
pred_train <- unlist(predict(rf1,df_train, type='prob')[,2])
pred_test <- unlist(predict(rf1, df_test, type='prob')[,2])
results["rf",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["rf",]
temp <- function(x){partial(ranger_obj, train=df_train, pred.var = x, plot = TRUE, plot.engine = "ggplot2", paropts = list(.packages = "ranger"))}
plots <- lapply(names(df_train)[cont_feats], temp)
wrap_plots(plots)
The random forest has a test accuracy of 0.837 and a test AUC of 0.883, which is higher than the previous linear models. However, the train accuracy and AUC are significantly higher than the test performance, which suggest that there may be some degree of overfitting to the train data.
Random forests can be used to rank the importance of different features. Specifically, the x-axis is the Mean Decrease Accuracy, which reports how much accuracy the model loses when we exclude this variable. The more the accuracy falls by, the more important the particular variable is. Note here that for categorical variables, each level of the category is classified as a single variable. In this plot, we recorded the 30 most important variables.
Finally, for model interpretability we can inspect the Random Forest model predictions using Partial Dependence Plots, which plots how the predictions varies as the predictors varies. For example, for cont4, holding all other variables constant, outcome increases as we start to increase the values of this variable. The outcome then plateaus and eventually falls as we further increase values of `cont4’.
For completeness, we also consider the xgboost library for gradient boosted decision trees. The XGBoost package, introduced in (Chen, Tianqi and Guestrin, Carlos 2016) is a variant of Gradient Boosted Decision Trees (Hastie, T. and Tibshirani, R. and Friedman, J.H. 2009, 353–74). For Gradient Boosting Trees, each decision tree is trained sequentially on the residuals of the previous decision tree, which has the effect of reducing bias, and the overall prediction is a linear combination of these individual decision trees.
Similar to Random Forests, the hyperparameters influence the model complexity of the GBDT. In this case we focus on two key hyperparameters: the maximum depth of each tree max_depth, and the learning_rate. We keep the the number of trees or “boosting rounds” fixed nround = 100 for faster training, although this could be a hyperparameter to tune as well.
y_train = as.integer(as.matrix(df_train$target))
y_test = as.integer(as.matrix(df_test$target))
bst <- xgboost(data = X_train, label=y_train, max_depth = 2, nround = 10,
verbose=0,
objective='binary:logistic',
eval_metric="logloss")
pred_train <- predict(bst, X_train, type="response")
pred_test <- predict(bst, X_test, type="response")
results["xgb",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["xgb",]
importance_matrix <- xgb.importance(model=bst)
xgb.plot.importance(importance_matrix)
# shapley values
xgboost::xgb.ggplot.shap.summary(X_test, model = bst,
target_class = 1, top_n = 20) # Summary plot
We obtain a test accuracy of \(0.827\) and test AUC of \(0.8627\). In this case, our train and test statistics seem comparable, although it is likely that further improvements can be made by further hyperparameter tuning.
For model interpretability we can use Feature Importances. In the case of XGBoost, the default feature importance is the gain, which measures the contribution of the feature to the prediction. We can also use to Shapley Values (Lundberg, Scott and Lee, Su-In 2017), another means of capturing how each feature contributes to each prediction which relies on concepts from game theory. As an example, from the Shapley Value plot, we can see that higher values of cat16B (which would imply that cat16 == B) is associated with a greater Shapley Value and impact on the prediction.
We also briefly examine the catboost library for gradient boosted decision trees, given its popularity on machine learning competitions such as Kaggle. The CatBoost library has the advantage of learning a target encoding, i.e. some numerical value for every category for each categorical variables. From an implementation point of view, this may reduce the preprocessing that may be required.For details on CatBoost, refer to (Prokhorenkova, Liudmila and Gusev, Gleb and Vorobev, Aleksandr and Dorogush, Anna Veronika and Gulin, Andrey 2017).
Again, the key parameters are the number of trees (iterations), and the depth of each tree. Due to computational limitations, we select arbitrary parameters iterations = 10 and depth = 8; these could be further fine-tuned using hyperparameter optimisation.
# Define Model Specification
cb_spec <- boost_tree(
trees = 100,
tree_depth = tune(),
learn_rate = tune() ## step size
) %>%
set_engine("catboost") %>%
set_mode("classification")
cb_wf <- workflow() %>%
add_formula(target ~.) %>%
add_model(cb_spec)
# Hyperparameter Tuning
set.seed(234)
trees_folds <- vfold_cv(df_train, v = 3)
doParallel::registerDoParallel()
set.seed(345)
cb_res <- tune_grid(
cb_wf,
resamples = trees_folds,
grid = 5
)
autoplot(cb_res)
# Fit best model
best_cb_auc <- select_best(cb_res, "roc_auc")
final_cb <- finalize_workflow(
cb_wf,
best_cb_auc
)
cb <- fit(final_cb, df_train)
cb_model <- pull_workflow_fit(cb)$fit
# Evaluate Metrics
X_train = df_train[,c(cat_feats, cont_feats)]
X_test = df_test[,c(cat_feats, cont_feats)]
y_train = as.integer(df_train$target) - 1
y_test = as.integer(df_test$target) - 1
pool <- catboost.load_pool(X_train, y_train, cat_features = cat_feats)
pred_train <- catboost.predict(cb_model, catboost.load_pool(X_train), prediction_type = 'Probability')
pred_test <- catboost.predict(cb_model, catboost.load_pool(X_test), prediction_type = 'Probability')
# Feature importance
feat_importance <- catboost.get_feature_importance(cb_model, pool)
importances <- data.frame(feat_importance[order(feat_importance, decreasing=FALSE),])
importances$features = rownames(importances)
names(importances) <- c("importance","features")
importances$features <- factor(importances$features, level=importances$features)
ggplot(importances, aes(x=importance, y=features)) +
geom_bar(stat="identity") +
ggtitle("CatBoost Feature Importance")
#Shapley Values
data_shap_tree <- catboost.get_feature_importance(cb_model, pool = pool,
type = "ShapValues")
data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)])
names(data_shap_tree) = names(df[, c(cat_feats, cont_feats)])
# ggplot(stack(data_shap_tree), aes(x = ind, y = values)) +
# geom_point(aes(color = values)) + coord_flip() +
# ggtitle("Shapley Values by variable") + scale_color_viridis_c()
results["catboost",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["catboost",]
We obtain a train accuracy of \(0.766\) and a train AUC of \(0.866\), versus a test accuracy of $0.753 $ and test AUC \(0.871\), and the train and test statistics are similar. From our hyperparameter tuning graph, it appears as if increasing tree_depth and the learning_rate do not really impact accuracy, but increasing either improves AUC. However, this could be an error or consuquence from using only 3 cross-validation folds, or using a small subsample of the larger dataset. It is possible that further hyperparameter tuning can be done to improve predictive performance.
For model interpretability, we can use CatBoost’s feature importance. From the CatBoost documentation, CatBoost’s in-built and default feature importance does the following:
For each feature, PredictionValuesChange shows how much on average the prediction changes if the feature value changes. The bigger the value of the importance the bigger on average is the change to the prediction value, if this feature is changed.
We could also set the type of feature importance to Shapley Values, although we omit the plot in this case due to the lack of in-built support for plotting.
results[order(results$test_auc, results$test_acc),]
Here, we ordered the list of models in ascending performance on the test data. We observe that complex models (random forests, xgboosts, catboosts) do not necessarily perform the best on the test data. Rather, the ridge logistic regression performs the best in terms of test AUC. Arguably, it is a simpler model than the tree-based models as it still relies on a logistic regression framework while shrinking coefficients towards 0.
Interestingly, we also observe a high degree of overfitting by the more complex models as they (especially the random forest) fit very well on the training data but do not generalise very well on an out-of-sample test data. Even though the ridge regression does not fit the train data as well, it is able to generalise extremely well to the test data and hence achieve the overall highest test AUC.
Furthermore, although the tree-based methods are less interpretable and more “black box” to some extent, we can make use of model interpretability techniques. In addition, it could be that our results are not fully robust, given we have subsampled the original data. Complex models tend to do better when the number of observations is large; hence, it is plausible that these will perform better if we use the entire data set.
In this project, we explored the use of logistic regression, gradient descent, regularisation, random forest, gradient boosting and hyperparameter tuning techniques. We compared and constrated the various models in terms of their performances in terms of AUC and accuracy, and their relative interpretability. From the models we explored, regularised logistic regression was the model with highest AUC, which suggests that a relatively interpretable model can also attain a fair amount of predictive performance, particularly when computational effort is considered.
On the other hand, it is possible that a higher predictive performance could be attained by further fine-tuning a non-linear, less interpretable model, for example using a deep neural network. We note that the winners of this contest made use of a denoising auto-encoder architecture (see here, which could be a potential avenue for further exploration. In practice, depending on whether the goals of the company lean towards prediction or explanation, an appropriate trade-off between model complexity and predictive performance vs interpretability has to be chosen.
From the insurer’s profit perspective, understanding which characteristics are highly correlated with large amounts of claims could also be used to inform whether or not a particular insurance application is accepted, even if causality cannot be inferred. We should note that despite the the regularised logistic regression being relatively interpretable, it is still relatively difficult to interpret, given the large number of predictors (when all levels of the categorical variable are considered), compared to our baseline model with few predictors This leads into discussion of the legal and ethical aspects of machine learning. For example, the UK’s 2010 Equality Act makes discriminating on nine protected characteristics, including age, gender, race and religion, illegal (Shini Pattni 2020). Therefore, proper data governance needs to be maintained and privacy impact assessments need to be conducted before any machine learning models of this kind are deployed in industry (IT Governance Ltd. 2021).
We investigate the impact of removing outliers (based on those identified by our logistic regression specification) on the coefficients. We find that there is no signiica
influencers <- as.numeric(rownames(influencePlot(glm_full)))
glm_full_influencers <- update(glm_full, subset = c(-influencers))
glm_full_outliers <- update(glm_full, subset = c(-outliers))
removal_list <- union(outliers, influencers)
glm_full_removed <- update(glm_full, subset = c(-removal_list))
compareCoefs(glm_full, glm_full_influencers, glm_full_outliers, glm_full_removed)
We briefly experiment with grouping infrequently occuring categories with the step_order preprocessing from tidymodels. We obtain a greater test AUC, and it appears that the training and test scores are similar. It is possible that the predictive performance could be further improved by adjusting the threshold; it appears that some information is lost when merging the categories, in particular comparing with the ridge logistic regression.
mod <- logistic_reg() %>% set_engine("glm", maxit=50)
glm_recipe <- recipe(target~.,data = df_train) %>%
step_other(all_nominal(), -all_outcomes(), threshold=0.05)
glm_workflow<- workflow() %>%
add_recipe(glm_recipe) %>%
add_model(mod)
glm <- fit(glm_workflow, df_train)
pred_train <- as.numeric(unlist(predict(glm, df_train, type='prob')[, 2]))
pred_test <- as.numeric(unlist(predict(glm, df_test, type='prob')[, 2]))
results["glm-merged-cats",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
results["glm-merged-cats",]
Remark: We were unable to get the inbuilt xgb.ggplot.shap.summary to work with tidymodels.
set.seed(2021)
xgb_spec <- boost_tree(
trees = 100,
tree_depth = tune(),
learn_rate = tune()
) %>%
set_mode("classification") %>%
set_engine("xgboost", objective='binary:logistic', eval_metric="logloss")
tune_wf <- workflow() %>%
add_formula(target~.) %>%
add_model(xgb_spec)
# Hyperparameter Tuning
set.seed(234)
trees_folds <- vfold_cv(df_train, strata = target, v = 3)
doParallel::registerDoParallel()
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = trees_folds,
grid = 2
)
autoplot(tune_res)
# fit best model from hyperparameters
best_auc <- select_best(tune_res, "roc_auc")
final_xgb <- finalize_model(
xgb_spec,
best_auc
)
final_wf <- workflow() %>%
add_formula(target ~.) %>%
add_model(final_xgb)
xgb1 <- fit(final_wf, df_train)
xgb2 <- pull_workflow_fit(xgb1)$fit
pred_train <- unlist(predict(xgb1,df_train, type='prob')[,2])
pred_test <- unlist(predict(xgb1, df_test, type='prob')[,2])
results["xgb",] <- evaluate_metrics(pred_train, pred_test, df_train$target, df_test$target)
importance_matrix <- xgb.importance(model=xgb2)
xgb.plot.importance(importance_matrix)